In this practical, a number of R packages are used. If any of them are not installed you may be able to follow the practical but will not be able to run all of the code. The packages used (with versions that were used to generate the solutions) are:
mice (version: 3.13.0)miceadds (version: 3.11.6)lme4 (version: 3.1.152)JointAI (version: 1.0.2)splines (version: 4.0.5)ggplot2 (version: 3.3.3)You can find help files for any function by adding a ? before the name of the function.
Alternatively, you can look up the help pages online at https://www.rdocumentation.org/ or find the whole manual for a package at https://cran.r-project.org/web/packages/available_packages_by_name.html
The focus of this practical is the imputation of data that has features that require special attention.
In the interest of time, we will focus on these features and abbreviate steps that are the same as in any imputation setting (e.g., getting to know the data or checking that imputed values are realistic). Nevertheless, these steps are of course required when analysing data in practice.
For this practical, we will use the EP16dat4 dataset, which is a subset of data from a trial on primary biliary cirrhosis (PBC) of the liver.
To get the EP16dat4 dataset, load the file EP16dat4.RData. You can download it here.
To load this dataset into R, you can use the command file.choose() which opens the explorer and allows you to navigate to the location of the file on your computer.
If you know the path to the file, you can also use load("<path>/EP16dat4.RData").
The variables contained in the dataset EP16dat4 are:
id
|
patient identifier |
day
|
continuously measured day of follow-up time (the time variable) |
sex
|
patients’ sex (f: female, m: male) |
trt
|
treatment group (0: D-penicillmain, 1: placebo) |
age
|
patients’ age at intake |
ascites
|
presence of ascites at baseline (0:no, 1:yes) |
hepato
|
presence of hepatomegaly or enlarged liver |
bili
|
serum bilirubin level at baseline |
copper
|
urine copper (ug/day) |
albumin
|
serum albumin level at follow-up (time-varying) |
The variables have the following distributions and proportions of missing values:
albumin, bili and hepato were measured repeatedly, day is the day of that measurement, and the other variables were only measured once, at baseline.
The missing data pattern is:
The longitudinal outcome albumin shows relatively linear trajectories over time:
To analyse the trajectories of albumin we want to use the following linear mixed effects model with random intercept and slope (either using lme from the package nlme or using lmer from the package lme4; and only AFTER we have imputed the data):
# using package nlme
lme(albumin ~ day + sex + trt + age + ascites + log(bili) + copper, random = ~day|id)
# using package lme4
lmer(albumin ~ day + sex + trt + age + ascites + log(bili) + copper + (day|id))With the current data, the approach to impute the data in wide format is not feasible, since the data are unbalanced, i.e., the measurements do not follow a pre-specified pattern, but different patients have different numbers of measurements and are measured at different time-points.
Theoretically, there is some functionality in mice for handling longitudinal data, using special imputation methods like "2l.norm", "2lonly.norm" or "2lonly.pmm", which take into account the multi-level structure of the data.
The package has undergone some recent updates with the result that we are unable to replicate the results we obtained before - data that was imputed now results in warning messages, and values are not being imputed. This may be a user error, due to changes in functionality, or it may be an issue with the functions itself.
Going forward we will perform imputation in this setting using an alternative approach.
The idea is to find a summary of the longitudinal variables that will capture the relevant characteristics of the longitudinal profiles so that as little information as possible is lost. This summary also has to be represented in the same number of values for each subject, and these values need to have the same interpretation for all subjects.
One option that fulfils these requirements is to fit mixed models to the longitudinal variables, and to represent the trajectories by the estimated random effects in the imputation (Erler et al., 2016). This approach, however, does not work for imputation of missing values in longitudinal variables. Hence, we will not consider the variable hepato here.
To get a good representation, the model needs to be flexible enough to fit the data sufficiently well.
We already saw how the traces for albumin look like, here is the corresponding plot for log(bili):
Our model of interest involves the longitudinal variables albumin and bili, which are both completely observed.
albumin and for log(bili).lmer() from the package lme4 is faster than lme() from nlme.
lmer(), you can set the argument control = lmerControl(optimizer = "bobyqa").
day and also include this spline specification in the random effects.ranef().
library("splines")
library("lme4")
# model for albumin
mod_albu0 <- lmer(albumin ~ age + sex + ns(day, df = 2) + (ns(day, df = 2) | id),
data = EP16dat4)
# model for log(bili)
mod_bili0 <- lmer(log(bili) ~ ns(day, df = 2) + (ns(day, df = 2) | id),
data = EP16dat4,
control = lmerControl(optimizer = "bobyqa"))
# extract the random effects
b_albu <- ranef(mod_albu0)$id
b_bili <- ranef(mod_bili0)$idhead(b_albu)## (Intercept) ns(day, df = 2)1 ns(day, df = 2)2
## 1 -0.444131 0.031104 0.269958
## 2 0.303353 -0.494036 -0.641426
## 3 0.046972 0.026262 -0.018399
## 4 -0.512325 -0.390684 -0.010675
## 5 -0.237390 -0.766961 -0.293498
## 6 0.440050 0.899370 0.472790
head(b_bili)## (Intercept) ns(day, df = 2)1 ns(day, df = 2)2
## 1 2.161133 2.03406 1.316344
## 2 -0.598690 0.40028 -0.531305
## 3 -0.287299 -0.15803 -0.168027
## 4 -0.046146 0.64818 0.061751
## 5 0.262130 2.43750 1.102044
## 6 -0.677375 -2.36292 -1.280326
To be able to distinguish between the random effects from the two models, we should give them different names:
names(b_albu) <- paste0("b_albu", 0:2)
names(b_bili) <- paste0("b_bili", 0:2)To check that our model fits the data suitably well, we can look at the observed and fitted values for a few cases:
# Combine the original data with the fitted values
plotDF <- cbind(EP16dat4,
fit_albu = fitted(mod_albu0),
fit_bili = fitted(mod_bili0)
)
# make a subset of 24 randomly chosen subjects
plotDF_sub <- subset(plotDF, id %in% sample(unique(EP16dat4$id), size = 30))
# For albumin:
ggplot(plotDF_sub, aes(x = day, y = albumin, group = id)) +
geom_point(alpha = 0.5, size = 1) +
geom_line(aes(y = fit_albu), color = 'blue') +
geom_point(aes(y = fit_albu), color = 'blue', alpha = 0.3) +
facet_wrap('id', ncol = 6) +
theme(panel.grid = element_blank())# For log(bili):
ggplot(plotDF_sub, aes(x = day, y = log(bili), group = id)) +
geom_point(alpha = 0.5, size = 1) +
geom_line(aes(y = fit_bili), color = 'blue') +
geom_point(aes(y = fit_bili), color = 'blue', alpha = 0.3) +
facet_wrap('id', ncol = 6) +
theme(panel.grid = element_blank())The next step is to get a wide-format version of our data. We do not need to worry about the longitudinal variables, instead we need to include the random effects as columns in this wide-format data.
match(unique(EP16dat4$id), EP16dat4$id) to select the first row for each subject.cbind() to add columns to a data.frame.
# combine the first row of each subject with the random effects
EP16dat_wide <- cbind(EP16dat4[match(unique(EP16dat4$id), EP16dat4$id), ],
b_albu, b_bili)
head(EP16dat_wide)## id trt age sex day ascites hepato bili albumin copper b_albu0 b_albu1 b_albu2 b_bili0
## 1 1 1 58.765 f 0 <NA> 1 14.5 2.60 156 -0.444131 0.031104 0.269958 2.161133
## 3 2 1 56.446 f 0 0 1 1.1 4.14 54 0.303353 -0.494036 -0.641426 -0.598690
## 12 3 1 70.073 m 0 0 0 1.4 3.48 NA 0.046972 0.026262 -0.018399 -0.287299
## 16 4 1 54.741 f 0 0 1 1.8 2.54 64 -0.512325 -0.390684 -0.010675 -0.046146
## 23 5 0 38.105 f 0 0 1 3.4 3.53 143 -0.237390 -0.766961 -0.293498 0.262130
## 29 6 0 66.259 f 0 0 1 0.8 3.98 50 0.440050 0.899370 0.472790 -0.677375
## b_bili1 b_bili2
## 1 2.03406 1.316344
## 3 0.40028 -0.531305
## 12 -0.15803 -0.168027
## 16 0.64818 0.061751
## 23 2.43750 1.102044
## 29 -2.36292 -1.280326
With these extra columns in the data, we can run the imputation using mice().
Perform the necessary steps to impute the data. Make sure you exclude unnecessary columns/variables from the imputation.
library("mice")
imp0 <- mice(EP16dat_wide, maxit = 0)## Warning: Number of logged events: 1
There is one logged event. This is only the set-up run, so it is not a problem, but we should be interested in what the problem might be to fix this in the actual imputation.
imp0$loggedEvents## it im dep meth out
## 1 0 0 constant day
The variable day is constant (because we used the first row for everyone, and the first measurement was taken on day 0 for all patients). We would exclude this variable from the imputation anyway since it is a longitudinal variable.
meth <- imp0$method
meth## id trt age sex day ascites hepato bili albumin copper b_albu0
## "" "" "" "" "" "logreg" "" "" "" "pmm" ""
## b_albu1 b_albu2 b_bili0 b_bili1 b_bili2
## "" "" "" "" ""
We do not need to change anything in the imputation method if we think that using pmm for copper is a good choice.
pred <- imp0$predictorMatrix
pred[, c('day', 'hepato', 'albumin', 'bili', 'id')] <- 0 # mice already set day to 0imp_mice <- mice(EP16dat_wide, method = meth, predictorMatrix = pred,
maxit = 20, seed = 2020, printFlag = FALSE)We now need to combine the imputed baseline data with the longitudinal variables in order to do our analysis.
mids object in long format and do not include the original data.split.
.imp refers to the imputation number.
data.frames in the list with the longitudinal variablesday) and the id variable.
mids object with the help of the function datalist2mids from the package miceadds.We first extract the data from the mids object. Here we use the option include = FALSE because we will later use the function datalist2mids(). If we would use as.mids() we would need to include the original data.
impdat <- complete(imp_mice, action = 'long', include = FALSE)We then split the data by imputation number. Then we merge each part of the data. We can conveniently wrap this into lapply(), because we want to apply the same function to each element of a list.
imp_list <- lapply(split(impdat, impdat$.imp), function(x) {
merge(subset(x, select = c(.imp, .id, id, trt, age, sex, ascites, copper)),
subset(EP16dat4, select = c(id, day, hepato, bili, albumin)),
all = TRUE)
})When converting the resulting list of longitudinal datasets to a mids object we would have (at least) two options: the function as.mids() and the function datalist2mids.
as.mids() does not work well with longitudinal data. It gives an error about duplicate row names. For that reason, we use datalist2mids.
mids_long <- miceadds::datalist2mids(imp_list)## Warning: Number of logged events: 1
Fit the mixed model of interest (as specified above) and obtain the pooled results.
You need to load the broom.mixed package (i.e., library("broom.mixed")) before pooling results from a mixed model.
lmer(albumin ~ ns(day, df = 2) + sex + trt + age + ascites + log(bili) + copper + (ns(day, df = 2)|id)
mod_mice <- with(mids_long,
lmer(albumin ~ ns(day, df = 2) + sex + trt + age + ascites +
log(bili) + copper + (ns(day, df = 2)|id),
control = lmerControl(optimizer = "bobyqa")))library("broom.mixed")
summary(pool(mod_mice), conf.int = TRUE)## term estimate std.error statistic df p.value 2.5 % 97.5 %
## 1 (Intercept) 4.20832527 0.11898217 35.36938 781.017 0.0000e+00 3.9747626 4.4418880
## 2 ns(day, df = 2)1 -0.81749358 0.06482659 -12.61047 1591.771 0.0000e+00 -0.9446480 -0.6903391
## 3 ns(day, df = 2)2 -0.57945461 0.09503884 -6.09703 1790.407 1.3205e-09 -0.7658533 -0.3930559
## 4 sexf -0.14529983 0.05988992 -2.42612 635.756 1.5539e-02 -0.2629058 -0.0276939
## 5 trt 0.02090007 0.03579124 0.58394 1556.818 5.5934e-01 -0.0493041 0.0911042
## 6 age -0.00790092 0.00176303 -4.48143 1885.037 7.8597e-06 -0.0113586 -0.0044432
## 7 ascites1 -0.20739674 0.07845242 -2.64360 250.790 8.7199e-03 -0.3619063 -0.0528872
## 8 log(bili) -0.16106260 0.01485340 -10.84349 223.805 0.0000e+00 -0.1903330 -0.1317922
## 9 copper -0.00055242 0.00032133 -1.71913 20.611 1.0057e-01 -0.0012214 0.0001166
To analyse incomplete longitudinal data using a linear mixed model the R package JointAI provides the functions lme_imp() and lmer_imp(). The specification of the main model components is analogous to the function lme() from the nlme package or lmer() from the lme4 package.
Specification of longitudinal models:
When imputing variables in a longitudinal (or other multi-level) model and there are missing values in baseline (level-2) covariates, models need to be specified for all longitudinal covariates, even if they do not have missing values. Specifying no model would imply that the incomplete baseline covariates are independent of the complete longitudinal variable (see also here). Therefore, JointAI automatically specifies models for all longitudinal covariates in such a setting.
An exception may be the time variable: it is often reasonable to assume that the baseline covariates are independent of the measurement times of the outcome and longitudinal covariates. To tell JointAI not to specify a model for the time variable, the argument no_model can be used.
| name | model | variable type |
|---|---|---|
lmm
|
linear mixed model | continuous variables |
glmm_lognorm
|
log-normal mixed model | skewed continuous variables |
glmm_beta
|
beta mixed model | continuous variables in [0, 1] |
glmm_logit
|
logistic mixed model | factors with two levels |
glmm_probit
|
probit mixed model | factors with two levels |
glmm_gamma_inverse
|
gamma mixed model (with inverse link) | skewed continuous variables |
glmm_gamma_identity
|
gamma mixed model (with identity link) | skewed continuous variables |
glmm_gamma_log
|
gamma mixed model (with log link) | skewed continuous variables |
glmm_poisson_log
|
poisson mixed model (with log link) | count variables |
glmm_poisson_identity
|
poisson mixed model (with identity link) | count variables |
clmm
|
cumulative logit mixed model | ordered factors with >2 levels |
mlogitmm
|
multinomial logit mixed model | unordered factors with >2 levels |
More info:
For the specification of the other arguments of lme_imp(), refer to
Run the imputation (start with n.iter = 500; this will take a few seconds).
lme_imp() with n.adapt = 0 and n.iter = 0, and extract the vector of model types using <mymodel>$models, where <mymodel> is the name you gave the model.
day.traceplot().use_ggplot = TRUE in traceplot().
library("JointAI")
JointAIlong <- lme_imp(albumin ~ ns(day, df = 2) + sex + trt + age + ascites +
bili + copper, random = ~ns(day, df = 2)|id,
models = c(copper = 'lognorm', bili = 'glmm_lognorm'),
no_model = 'day', data = EP16dat4, n.iter = 500, seed = 2020)traceplot(JointAIlong, use_ggplot = TRUE) +
theme(legend.position = 'none',
panel.grid = element_blank())The traceplot shows that there is some autocorrelation (i.e., values are very similar to the previous value) in some of the chains (for example, ns(day, df = 2)2).
To continue the sampling in order to run the model for longer, we can use the function add_samples(). The number of additional iterations can be specified using the argument n.iter.
traceplot() again.GR_crit())MC_error()).JointAIlong_extra <- add_samples(JointAIlong, n.iter = 1000)traceplot(JointAIlong_extra, use_ggplot = TRUE) +
theme(legend.position = 'none',
panel.grid = element_blank())The traceplots now look a bit better.
By setting the argument autoburnin = FALSE in GR_crit() we select to use the full MCMC chains. By default (autoburnin = TRUE) the function automatically discards the first half of the iterations and only uses the second half to evaluate convergence.
GR_crit(JointAIlong_extra, autoburnin = FALSE)## Potential scale reduction factors:
##
## Point est. Upper C.I.
## (Intercept) 1.00 1.00
## sexf 1.00 1.00
## trt1 1.00 1.00
## age 1.00 1.00
## ascites1 1.01 1.03
## copper 1.00 1.00
## ns(day, df = 2)1 1.01 1.02
## ns(day, df = 2)2 1.06 1.19
## bili 1.00 1.02
## sigma_albumin 1.00 1.00
## D_albumin_id[1,1] 1.01 1.03
## D_albumin_id[1,2] 1.01 1.04
## D_albumin_id[2,2] 1.01 1.03
## D_albumin_id[1,3] 1.05 1.17
## D_albumin_id[2,3] 1.01 1.03
## D_albumin_id[3,3] 1.01 1.02
##
## Multivariate psrf
##
## 1.12
For some parameters, the criterion is still too large. The traceplot indicates that this is not really an issue of convergence, but slow mixing: because of the high correlation between subsequent samples, the chain only moves very slowly. As a result, there is more variation between the chains that within one chain.
We would have to run the model even longer to give the chains the “time” to fully explore the full range of the posterior distribution.
MC_error(JointAIlong_extra)## est MCSE SD MCSE/SD
## (Intercept) 4.19683 3.5e-03 0.12529 0.028
## sexf -0.12265 1.9e-03 0.06305 0.031
## trt1 0.00978 9.4e-04 0.03795 0.025
## age -0.00736 6.1e-05 0.00193 0.031
## ascites1 -0.27734 2.7e-03 0.08791 0.031
## copper -0.00092 1.3e-05 0.00028 0.046
## ns(day, df = 2)1 -0.91264 3.9e-03 0.06844 0.057
## ns(day, df = 2)2 -0.70565 1.5e-02 0.12384 0.122
## bili -0.02413 6.5e-05 0.00246 0.026
## sigma_albumin 0.31713 1.8e-04 0.00592 0.030
## D_albumin_id[1,1] 0.07738 7.1e-04 0.01104 0.065
## D_albumin_id[1,2] -0.02761 1.8e-03 0.02368 0.075
## D_albumin_id[2,2] 0.37158 6.2e-03 0.08518 0.073
## D_albumin_id[1,3] -0.06304 4.0e-03 0.03755 0.106
## D_albumin_id[2,3] 0.25605 1.2e-02 0.10750 0.113
## D_albumin_id[3,3] 0.43788 2.9e-02 0.21158 0.138
The Monte Carlo error is also still larger then we would want it to be.
Based on the Gelman-Rubin criterion and the Monte Carlo error we need more MCMC samples to get precise results. We can either run the MCMC chains for longer, but since the issue is not that the chains do not have converged, we could also get more samples by increasing the number of MCMC chains.
Using more MCMC chains has the advantage that different chains are independent and that they can be run in parallel.
Note: On computers with multiple CPUs it is possible to run the different MCMC chains of a model in parallel to save some time.
This can be achieved using the doFuture package. For details see the vignette on MCMC settings.
We want to fit a logistic mixed model for the variable hepato and explore if the association is non-linear over time.
glme_imp using the same covariates as before.day.traceplot().ns() from the package splines, i.e., ns(day, df = 3).
library("splines")
JointAIlong2 <- glme_imp(hepato ~ ns(day, df = 3) + sex + trt + age + ascites +
bili + copper, random = ~1|id, family = binomial(),
models = c(copper = 'lognorm', bili = 'glmm_gamma_log'),
no_model = 'day', data = EP16dat4, n.iter = 1000,
seed = 2020)##
## The variables "hepato", "trt" were converted to factors.
traceplot(JointAIlong2, use_ggplot = TRUE) +
theme(legend.position = 'none',
panel.grid = element_blank())When the model has converged, we want to visualize the potentially non-linear association of day. To do that, we can create a new dataset containing information on an “average” subject, with different values for day.
The function predDF() creates such a dataset from an object of class JointAI. It sets reference values (i.e., the median for continuous variables and the reference category for categorical variables) for all variables other than the ones specified in the argument var (a one-sided formula). The variables given in var will range across the range of values of that variable encountered in the data.
Use predDF() to create a dataset that allows visualization of the effect of day.
newdf <- predDF(JointAIlong2, var = ~day)
head(newdf)## hepato day sex trt age ascites bili copper id
## 1 0 0.00 m 0 48.871 0 1.4 64 126
## 2 0 52.04 m 0 48.871 0 1.4 64 126
## 3 0 104.08 m 0 48.871 0 1.4 64 126
## 4 0 156.12 m 0 48.871 0 1.4 64 126
## 5 0 208.16 m 0 48.871 0 1.4 64 126
## 6 0 260.20 m 0 48.871 0 1.4 64 126
We can now predict the outcome of our model for our “average” subject using the function predict(). It takes a JointAI object and a data.frame containing the data to predict from as arguments. The argument quantiles can be used to specify which quantiles of the distribution of each fitted value are returned (default is 2.5% and 97.5%).
predict() returns a list with the following elements
dat: the data.frame provided by the user extended with the fitted values and 2.5% and 97.5% quantiles that form the credible interval for the fitted valuesfit: a vector containing the fitted values (the mean of the distribution of the fitted value)quantiles: a matrix containing the credible interval for each fitted valuepredict() to obtain the fitted values and corresponding intervalsday; x-axis)pred <- predict(JointAIlong2, newdata = newdf, type = "response")## Warning:
## Prediction in multi-level settings currently only takes into account the fixed effects,
## i.e., assumes that the random effect realizations are equal to zero.
ggplot(pred$newdata, aes(x = day, y = fit)) +
geom_ribbon(aes(ymin = `2.5%`, ymax = `97.5%`), alpha = 0.3) +
geom_line() +
theme(panel.grid = element_blank()) +
ylab("expected probability")© Nicole Erler